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Abstract. We study the convergence properties of a pair of learn- 
ing algorithms (learning with and without memory) . This leads us 
to study the dominant eigenvalue of a class of random matrices. 
This turns out to be related to the roots of the derivative of random 
polynomials (generated by picking their roots uniformly at random 
in the interval [0, 1], although our results extend to other distribu- 
tions). This, in turn, requires the study of the statistical behavior 
of the harmonic mean of random variables as above, which leads 
us to delicate question of the rate of convergence to stable laws 
and tail estimates for stable laws. The reader can find the proofs 



of most of the results announced here in KR2001a 



The original motivation for the work in this paper was provided 
by the first-named author's research in learning theory, specifically 
in various models of language acquisition (see |[KNN200"T| , [NKN2001| , 
K1N2001|| ) and more specifically yet by the analysis of the speed of con- 



vergence of the memoryless learner algorithm. Curiously, our methods 
also result in a complete analysis of learning with full memory, as shown 
in some detail in section p.2| . The setup is described in section so 
here we will just recall the essentials. There is a collection of concepts 
Ri,...,Rn and words which refer to these concepts, sometimes am- 
biguously. The teacher generates a stream of words, referring to the 
concept Ri. This is not known to the student, but he must learn by, 
at each steps, guessing some concept Ri and checking for consistency 
with the teacher's input. The memoryless learner algorithm consists of 
picking a concept Ri at random, and sticking by this choice, until it is 
proven wrong. At this point another concept is picked randomly, and 
the procedure repeats. Learning with full memory follows the same 
general process with the important difference that once a concept is 
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rejected, the student never goes back to it[|. It is clear that once the 
student hits on the right answer Ri, this will be his final answer, so 
the question is then: 

How quickly do the two methods converge to the truth? 

Since the first method is memoryless, as the name implies, it is clear 
that the learning process is a Markov process, and as is well-known the 
convergence rate is determined by the gap between the top (Perron- 
Frobenius) eigenvalue and the second largest eigenvalue. However, we 
are also interested in a kind of a generic behavior, so we assume that the 
sizes of overlaps between concepts are random, with some (sufficiently 
regular) probability density function supported in [0,1], and that the 
number of concepts is large. This makes the transition matrix random, 
though of a certain restricted kind, as described in detail in section 
The analysis of convergence speed then comes down to a detailed 
analysis of the size of the second-largest eigenvalue and also of the 
properties of the eigenspace decomposition. The analysis for learning 
with full memory is quite different, but the results have a very similar 
form. We summarize below: 

Theorem 0.1. Let be the number of steps it takes for the student 
to have probability 1 — A of learning the concept. Then we have the 
following estimates for N/\: 

• if the distribution of overlaps is uniform, or more generally, the 
density function /(I — x) at has the form f(x) = c + 0{x^), 
6,0 > 0, then there exist positive constants Ci, C2, C2 such 
that 

lim p f Ci < — < a 

n^oo y |logA|nlogn 
for the memoryless algorithm and 



lim PiC[< ^^f- < C'^ 

/Xyn log 72 



when learning with full memory; 

if the probability density function /(I — x) is asymptotic to cx^ + 
0{x'^~^^), 5,(3 > 0, as X approaches 0, then for the two algo- 
rithms we have respectively 

lim P ( ci < 1 , 1 < C2 

n^oo y llogAp 



^Another important learning algorithm is the so-called batch learner. This is 
analysed completely in [R2001] 
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and 

lim P f c'l < ,^ < Cn 

/or some positive constants ci, C2, c'^, c'a/ 

the asymptotic behavior is as above, but —1<(3<0, then 

lim P I — < . , , ^, < X 



X 



logA|nV(i+/3) 



for the memoryless learning algorithm, and similarly 

^'^^{x ^ (1 - A)2ni/(i+/3) <^ 
for learning with full memory. 

It should be said that our methods give quite precise estimates on 
the constants in the asymptotic estimate, but the rate of convergence 
is rather poor - logarithmic - so these precise bounds are of limited 
practical importance. 

1. Eigenvalues and polynomials 

In order to calculate the convergence rate of the learning algorithm 
described above, we need to study the spectrum of a class of random 
matrices. The matrices have the following form: 

(1) T,=h^ 

«j 1 iLJhl otherwise, 

n—l ' 

where 

(2) oi = 1, 0<ai<l, 2<i<n. 

Let B = ^^{I — T), so that the eigenvalues of T, Aj, are related to the 
eigenvalues of B, /i, by Aj = 1 — [n/ {n — 1)] yUj. We show the following 
amusing 

Lemma 1.1. Let p{x) = {x — xi) ... {x — Xn) , where Xi = 1 — ai. Then 
the characteristic polynomial pb of B satisfies: 

X dp{x) 



Pb[x) 



n dx 



From lemma the second largest eigenvalue of the matrix T, A,, 
and the smallest root of which we denote related as 

(3) A^ = 1 -^i^. 

n — l 
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Therefore, we need to study the distribution of the smallest root of 
p'{x), given that the smallest root of p{x) is fixed at 0. Letting the 
roots of be = xi < a;2 < • • • < Xn, and letting 

(4) H{x2,...,Xn)^ ^"""^^ 



Er=2 

be the harmonic mean of the nontrivial roots of p{x), we have 
Theorem 1.2. The smallest root of p'{x) satisfies: 

(5) ^H{x2, ...,Xn)<{n- l)Ai* < H{X2, ...,Xn). 

We can see that the study of the distribution of entails the study 
of the distribution of the asymptotic behavior of the harmonic mean of 
a sample from a distribution on [0, 1]. 



2. Statistics of the harmonic mean. 

The arithmetic, harmonic, and geometric means are examples of the 
"conjugate means" , given by 

mj^{xi, ...,Xn)^ ^ T{xi)^ , 

where T[x) = x for the arithmetic mean, J-'{x) = log(a;) for the geo- 
metric mean, and J-'{x) = 1/x for the harmonic mean. The interesting 
situation is when JF has a singularity in the support of the distribution 
of X, and this case seems to have been studied very little, if at all. Here 
we will devote ourselves to the study of harmonic mean. 

Given xi, . . . , a;„ - a sequence of independent, identically distributed 
in [0, 1] variables (with common probability density function /) , the 
nonlinear nature of the harmonic mean leads us to consider first the 
random variable 

1 " 1 

(6) ^n = -E-- 

n ^-^ Xi 

i=i * 

Since the variables 1/xi are easily seen to have infinite expectation and 
variance, our prospects seem grim at first blush, but then we notice 
that the variable 1/xi falls straight into the framework of the "stable 
laws" of Levy - Khintchine, which is briefly presented below. 
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2.1. Stable limit laws. Consider an infinite sequence of independent 
identically distributed random variables yi, ...,?/„,... , with probabil- 
ity distribution function ^. Typical questions studied in probability 
theory are the following. 

Let Sn = J2j=iyj- How is Sn distributed? What can we say 
about the distribution of Sn as n ^ oo? 

The best known example is one covered by the Central Limit Theorem: 
if ^ has finite mean /i and variance cr^, then (S'„ — n/i)/(v^cr) converges 
in distribution to the normal distribution ( ||Norrisl940|| ). Similarly, 
we say that the variable X belongs to the domain of attraction of a 
non-singular distribution G, if there are constants Oil , . . . , Oifij ■ ■ ■ and 
hi, . . . ,bn, ■ ■ ■ such that the sequence of variables Yk = akSk — bk con- 
verges in distribution to G. It was shown by Levy and by Khintchine 
that having a domain of attraction constitutes severe restrictions on 
the distribution as well as the norming sequences {a^} and {bk}- To 
wit, one can always pick = k~^^"'l{k), < a < 2, where l{k) is a 
slowly varying function (in the sense of Karamata). In that case, G is 
called a stable distribution of exponent a. If the variable y belongs to 
the domain of a stable distribution of exponent a > 1, then y has an 
expectation fx; just as in the case a = 2, we can choose bk = k^~^/°'ix. 
When a < 1, the variable y has no mean, and it turns out that we 
can take bk = 0; for a = 1, we can take 6„ = clogra, where c is some 
constant depending on ^. In particular, the normal distribution is a 
stable distribution of exponent 2 (and is unique, up to scale and shift). 
This is one of the few cases where we have an explicit expression for the 
density of a stable distribution. The Fourier transforms of the densities 
are explicitely known; the reader can find them in ||FellerV2|, Chapter 



XVII]. The stable distribution of a given exponent are parameterized 
by parameters p, q, C defined below: 

(7) lim , I'^^^l , = Cp, 

(8) lim = Cg, 

x^oo 1 — ^[X) 

and p + q = 1. We will say that the stable law is unbalanced if p = 1 
or g = 1 above. This will happen if the support of the variable y is 
positive - this will be the only case we will consider in the sequel. Note 
that this does not mean that the stable distribution is supported away 
from —oo, though that is true for exponents smaller than 1. 

2.2. Limiting distribution of the harmonic mean. Which partic- 
ular stable law comes up in the study of the variable X„ in (H), depends 
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on the distribution function f{x). Let us assume that 

f{x) X cx^, 

as X —>■ (for the uniform distribution f3 = 0, c = 1). (The notation 
b ^ a means that a is asymptotically the same as b, i.e. there exist 
constants Ci, C2, di, d2, so that Cib + di < a < C2& + ^2.) Then we have 

Theorem 2.1. If f3 = 0, then let Yn = Xn — logn. The variables 
Yn converge in distribution to the variable y distributed in accordance 
to the unbalanced stable law G'-"^ with a = 1. If l3 > 0, then X„ 
converges in distribution to 6{x — fi), where /i = E(l/a;j) (since the 
Xi are identically distributed the value of the index i is not relevant). 
If -1 < (3 < 0, then ni-V(i+/3)x„ converges in distribution to a the 
variable X distributed in accordance to the stable law with exponent 
a = l + (3. 

Remark. In the case when the variables xi, . . . ,Xi have positive and 
continuous density at 0, the variables X„ above converge to the Cauchy 
distribution (the symmetric stable distribution of exponent 1). This 
is the content of exercise 7.6 in [ purett9T| , though the (necessary) 



condition of positivity of the density at is inadvertently omitted there. 

The Theorem points us in the right direction, since it allows us 
to guess the form of the following results (if„ is the harmonic mean of 
the variables): 

Theorem 2.2. Let Hn = and j3 = 0. Then there exists a con- 

stant £1 such that 

lim E(if„logr;,) = Ci. 



Theorem 2.3. Suppose (3 > 0, let y = 1/x, and let fi be the mean of 
the variable y. Then lim„^ooE(/iif„) = 1. 

Finally, 

Theorem 2.4. Suppose (3 < 0. Then there exists a constant (E2 such 
i/iai E(if„/n^-i/(i+/3)) = ^2. 

We also have the following laws of large numbers: 

Theorem 2.5. Laws of large numbers for harmonic mean. Let 

j3 = and let a > 0. Then 

lim P{\H^\ogn - (ti\ > a) = Q. 

n— >oo 

If j3 > 0, and /x is as in the statement of Theorem ^TSj , then 
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lim P(|i7„--|)>a) = 0. 

The proofs of the above resuhs use a variety of estimates; the reader 
is referred to ||KR2001a|] . In addition to the laws of large numbers, we 



also have the following limiting distribution results: 

Theorem 2.6. For a = 1, the random variable logn(i7„ logri — €i) 
converges to 1 — G{—x/<ti), where G is the limiting distribution (of 
exponent a = 1) of variables Yn = X„ — clogra and Ci = 1/c. 

Theorem 2.7. For a > 1, the random variable n^~^^"{Hn — ^) con- 
verges in distribution to the variable Ti with the distribution function 
1 — G^—xE"^), where G is the unbalanced stable law of exponent a. 

Theorem 2.8. For < a < 1, the random variable con- 
verges in distribution to the variable Ti, with the distribution function 
l — G{l/x), where G is the distribution function of the unbalanced stable 
law of exponent a. 

3. A PAIR OF LEARNING ALGORITHMS 

3.1. The memoryless learner algorithm. Suppose there are n in- 
tersecting sets, Ri, . . . ,Rn, and n probability measures, z/i,... , z/„, 
each defined on its set (so that i'i{Ri) = 1). The similarity matrix 
A is given by aij = Vi{Rj). It follows that < aij < 1 and an = 1 for 
all i and j. 

Let us consider a typical problem of learning theory. A teacher gen- 
erates a sequence of points which belong to one of these sets, say to set 
Ri. The total length of the sequence is N. The learner's task is to guess 
what set is the teacher's set after receiving N points. For simplicity we 
assume here that aij < 1 for i j, which means that no set is a subset 
of another set. Many different algorithms are available to the learner, 
one given by the so-called memoryless learner algorithm [|Niyogil998 



a favorite with learning theorists. It works in the following way. The 
learner starts by (randomly) choosing one of the n sets as an initial 
state. Then N sample points are received from the teacher. For each 
sampling, the learner checks if the point belongs to its current set. If it 
does, no action is taken; otherwise, the learner randomly picks a differ- 
ent set. The initial probability distribution of the learner is uniform: 
p(o) = (1/n, . . . , l/n)"^, i.e. each of the sets has the same chance to be 
picked at the initial moment. The discrete time evolution of the vector 
p(*) is a Markov process with transition matrix T, which depends on 
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the similarity matrix, A. The transition matrix is given by Eqs. (P, 
(1) with ai = z/i(i?i). 

After N samphngs, the probabihty of learning the correct set is given 
by Qii = [(p*-^'')"^ T^]i. It is clear that the convergence rate of the 
memoryless algorithm can be determined if we study properties of the 
matrix T. We are interested in the rate of convergence as a function of 
n, the number of possible sets. 

We define the convergence rate of the method as the difference 1 — 
Qii. In order to evaluate the convergence rate of the memoryless 
learner algorithm, let us represent the matrix T as T = V^A14^, where 
the diagonal matrix A consists of the eigenvalues of T, which we call 
Aj, 1 < « < n; the columns of the matrix V are the right eigenvectors 
of T, Vj, and the rows of the matrix W are the left eigenvectors of T, 
Wj, normalized to satisfy < Wj,Vj >= 6ij (so that VW = WV = I). 
The eigenvalues of T satisfy |Aj| < 1. We have 

= VA^W. 

Let us arrange the eigenvalues so that Ai = 1 and A2 = A=k is the second 
largest eigenvalue. If is large, we have <^ A^ for all i > 3, so 
only the first two largest eigenvalues need to be taken into account. 
This means that in order to evaluate we only need the following 
eigenvectors: Vi = 1/n, . . . , l/n)'^ , V2, Wi = (n, 0, 0, . . . , 0), and 

W2. The result is: 

(9) Qii = 1 - CAf , 

where C = — Yl'j=iW2]j[^2]i/ It follows, therefore, that the conver- 
gence rate of the memoryless learner algorithm can be estimated if we 
estimate A,, and C. It turns out that once we understand A* , we can 
also estimate C. 

Our results can be summarized as follows. For large n, the quantity 
C is bounded from above and below by some constants. From formulas 
d^) and we can see that in order for the learner to pick up the correct 
set with probability 1 — A, we need to have at least 

(10) AT^- |logA|//i, 

sampling events (Theorem p.5| tells us that /i* = o{l/n), and so we 
have the right to replace log(l — fi^:) by — /i*). Using the relationship 
between fi^ and the harmonic mean and our results for Hn from 



Theorem |2.5| , we obtain the following estimate: 
(11) N^^\logA\h{n), 

where h{n) is nlogn if the overlaps are uniformly distributed (in other 
words, the entries ai,- of the similarity matrix, as random variables, are 
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uniformly distributed in [0, 1]), and h{n) is n if the density of overlaps 
at 1 goes to 0. Estimate should be understood in the sense that 
the right hand side of (p!OD converges in probability to the right hand 
side of (pHI). If the density grows at 1 as (1 — x)^,— 1 < P < 0, then 

lim P ( - < , < a; ) =1. 

x^oo \x I log A|ni/(i+/3) J 

3.2. A better algorithm. Consider the following improvement on 
the previous learning algorithm: the student keeps a list of the sets 
he has not rejected, and when the time comes to switch, he picks uni- 
formly among those sets only. It is clear that this algorithm ("learning 
with full memory" ) should perform better than the memoryless learner 
algorithm described in the last section, but how much better? 

Since the analysis is quite simple, we present it here. There are 
two questions which need to be answered (we always assume that the 
correct answer is the first set, Gi): 

Question 1. Suppose the student has picked the set Gj, 
i 7^ 1. What is the expected number of turns before he is 
forced to reject Gi and jump to a different set? 

Question 2. What is the probability that the student will 
change his mind exactly k times before guessing the right 
answer? 

We answer the second question first, by 

Lemma 3.1. The probability that the set Gi is encountered on the k-th 
turn is independent of k (and so equals 1/n.) 

Proof. Suppose the student starts by picking a set Gj^ at random, and 
then keeps picking sets Gi^, Gi^, . . . , G,^, until there are none left, and 
making sure never to repeat a set. The sequence ii, . . . ,in is a per- 
mutation of the sequence 1, . . . , n, and it is clear (for reasons of sym- 
metry) that every permutation is equally likely. Since for any k, pre- 
cisely {n — 1)\ permutations have 1 in the k-th position, the lemma is 
proved. □ 

Question 1 is also easily answered, by 

Lemma 3.2. //z/i(Gj) = a^, then the expected number of turns before 
switching zs 1/(1 — Oj). 

Proof. Let Vk be the probability of switching on the k-th step or earlier. 
Then we have the equation: 

(12) Pfc+i = Vk + il- Vk){l - ai) = aiVk + (1 - a^). 
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Since Vq = 0, it is easy to check that Vj = 1 — o.^ • If Pk is the probabihty 
of switching on the k-th. turn, then pk = af~^ — a^, and the expected 
time of switching is 

oo oo ^ 

(13) E^(«r'-«') = E«' = T— T' 

j=l j=0 

the first equahty being obtained by telescoping the sum. □ 

From the two lemmas, it follows that given the probabilities a2, . . . , a^, 
the expected time taken by the improved learner is 

ra-l 



T = - V^— V 

n ^ ("^1) 4^ ^ 1 - 



where the middle summation is over all subsets S'^ of 2, . . . , n which 
have size k. Since for any i, the number of subsets of 2, . . . , n of size k 
containing i equals {^Zi) 5 the above expression can be rewritten as 



n— 1 In— 2 

I 

T 



\ \ ^ (fc-i) \ ^ 1 

k=l \ k ) i=2 * 



n 

(14) 

n — 1 



El X ^ k 1 X ^ 1 

l-ai^n(n- 1) ~ 

i=2 * k=l ^ ' i=2 



j-^ n{n - 1) 2 ^ 1 - tti 2if„_i ' 

where H^-i is defined in (^) with xi = 1 — Oj. These computations 
can be easily adapted to solve the following problem: suppose that we 
want to be 1 — A sure of getting to the right answer. How many steps 
do we need? Notice that we will need to take (1 — A)n jumps, so the 
computation as above gives us: 



(15) 



^ (l-A)n n ^ 

k=l \ k ) i=2 

n ^ (l-A)n 

^ 1 - a,; ^ 



n(l - A)^ 



nin - 1) 2Hr. 



Comparing this with equation (|T0|) and using estimate (H), we notice 
that for every fixed A < 1, this is only a constant factor better than a 
memoryless learner. The constant is a function of A, and behaves as 
^ I l^S^I' so Soes to infinity (albeit slowly) as A approaches 0. 
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